VSFS: A Versatile Searchable File System for HPC Analytics

نویسندگان

  • Lei Xu
  • Ziling Huang
  • Hong Jiang
  • Lei Tian
  • David Swanson
چکیده

Emerging HPC analytics applications urgently demand filesearch services to drastically reduce the scale of the input data in real-time, so that the speed of computation and data analytics can be greatly accelerated. Unfortunately, the existing file-search solutions are either poorly scalable for large-scale systems, or lack a well-integrated interface to allow applications to easily use them for critical tasks. We believe that the time is ripe for the design of a searchable file system capable of accurate and scalable system-level filesearch functionality. In this paper, we propose a Versatile Searchable File System, VSFS, which provides a transparent, accurate and real-time file-search service through a POSIX-compatible file system namespace that can be integrated into any HPC/Big Data legacy code without modifications. Additionally, to support real-time file search, VSFS uses a DRAM-based distributed architecture to perform real-time file indexing. Moreover, a versatile index scheme is designed to adapt to the various forms of HPC datasets. The results of our VSFS prototype evaluation show that VSFS is scalable in a typical HPC environment. It achieves significantly better file-indexing and file-search performance than the popular SQL/NoSQL solutions, while it only introduces negligible I/O overhead. Finally, we integrate VSFS to a scientific analytics application to show its benefits in terms of performance and ease of use.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Providing Flexible File-Level Data Filtering for Big Data Analytics

The enormous amount of big data datasets impose the needs for effective data filtering technique to accelerate the analytics process. We propose a Versatile Searchable File System, VSFS, which provides a transparent, flexible and near real-time file-level data filtering service by searching files directly through the file system. Therefore, big data analytics applications can transparently util...

متن کامل

Propeller: A Scalable Metadata Organization for A Versatile Searchable File System

The exponentially increasing amount of data in file systems has made it increasingly important for file systems to provide fast file-search services. The quality of the file-search services is significantly affected by the file-index overhead, the file-search responsiveness and the accuracy of search results. Unfortunately, the existing file-search solutions either are so poorly scalable that t...

متن کامل

Microsoft Word - EvaluationOfJava_ieeeformat_2.docx

 Abstract—In the last few years, Java gain popularity in processing “big data” mostly with Apache big data stack – a collection of open source frameworks dealing with abundant data, which includes several popular systems such as Hadoop, Hadoop Distributed File System (HDFS), and Spark. Efforts have been made to introduce Java to High Performance Computing (HPC) as well in the past, but were no...

متن کامل

HPC and Big Data Convergence for Extreme Heterogeneous Systems

As the data deluge grows ever greater, large-scale data analytics workloads are quickly becoming critical computational tools within the scientific community. Recently, convergence efforts have focused on combining aspects HPC and ”big data” analytics workloads together using a unified supercomputing system. This has the opportunity to bring advanced analytical tools to scientists which enable ...

متن کامل

A Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection

Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013